- About the Data
- Importing Libraries
- Reading the Data
- Cleaning the Data
- Exploratory Data Analysis
- Basic Info about Dataset
- Amount of Vaccinated People
- Country wise Daily vaccination
- Percent of Population Vaccinated
- People vaccinated once VS People vaccinated twice
- Most Popular Vaccine Scheme
- Market share of Vaccine Schemes
- TreeMap of Total Vaccinations per country, grouped by Vaccine Scheme
- Visualising on a Map
- Animating the Vaccination Progress
- Conclusion
This Notebook would not have been possible without this Dataset provided by @Gabriel Preda.
The Data contains the following information:
We initialize the Python packages we will use for data ingestion, preparation and visualization. We will use :
Pandas to read and process the data.Plotly for visualization.import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
df = pd.read_csv('country_vaccinations.csv')
df.head()
| country | iso_code | date | total_vaccinations | people_vaccinated | people_fully_vaccinated | daily_vaccinations_raw | daily_vaccinations | total_vaccinations_per_hundred | people_vaccinated_per_hundred | people_fully_vaccinated_per_hundred | daily_vaccinations_per_million | vaccines | source_name | source_website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | ALB | 2021-01-10 | 0.0 | 0.0 | NaN | NaN | NaN | 0.00 | 0.00 | NaN | NaN | Pfizer/BioNTech | Ministry of Health | https://shendetesia.gov.al/covid19-ministria-e... |
| 1 | Albania | ALB | 2021-01-11 | NaN | NaN | NaN | NaN | 64.0 | NaN | NaN | NaN | 22.0 | Pfizer/BioNTech | Ministry of Health | https://shendetesia.gov.al/covid19-ministria-e... |
| 2 | Albania | ALB | 2021-01-12 | 128.0 | 128.0 | NaN | NaN | 64.0 | 0.00 | 0.00 | NaN | 22.0 | Pfizer/BioNTech | Ministry of Health | https://shendetesia.gov.al/covid19-ministria-e... |
| 3 | Albania | ALB | 2021-01-13 | 188.0 | 188.0 | NaN | 60.0 | 63.0 | 0.01 | 0.01 | NaN | 22.0 | Pfizer/BioNTech | Ministry of Health | https://shendetesia.gov.al/covid19-ministria-e... |
| 4 | Albania | ALB | 2021-01-14 | 266.0 | 266.0 | NaN | 78.0 | 66.0 | 0.01 | 0.01 | NaN | 23.0 | Pfizer/BioNTech | Ministry of Health | https://shendetesia.gov.al/covid19-ministria-e... |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4435 entries, 0 to 4434 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 4435 non-null object 1 iso_code 4131 non-null object 2 date 4435 non-null object 3 total_vaccinations 2916 non-null float64 4 people_vaccinated 2483 non-null float64 5 people_fully_vaccinated 1662 non-null float64 6 daily_vaccinations_raw 2467 non-null float64 7 daily_vaccinations 4281 non-null float64 8 total_vaccinations_per_hundred 2916 non-null float64 9 people_vaccinated_per_hundred 2483 non-null float64 10 people_fully_vaccinated_per_hundred 1662 non-null float64 11 daily_vaccinations_per_million 4281 non-null float64 12 vaccines 4435 non-null object 13 source_name 4435 non-null object 14 source_website 4435 non-null object dtypes: float64(9), object(6) memory usage: 519.9+ KB
As you can see, a lot of null values are present in the data. This may be because:
There may be other reasons but let's move forward with cleaning the data and filling the missing values.
df.isnull().sum()
country 0 iso_code 304 date 0 total_vaccinations 1519 people_vaccinated 1952 people_fully_vaccinated 2773 daily_vaccinations_raw 1968 daily_vaccinations 154 total_vaccinations_per_hundred 1519 people_vaccinated_per_hundred 1952 people_fully_vaccinated_per_hundred 2773 daily_vaccinations_per_million 154 vaccines 0 source_name 0 source_website 0 dtype: int64
df[df['iso_code'].isnull()]['country'].value_counts()
Scotland 76 England 76 Wales 76 Northern Ireland 76 Name: country, dtype: int64
df = df.loc[-df.country.isin(['England', 'Scotland', 'Wales', 'Northern Ireland'])]
df[df['iso_code'].isnull()]['country'].value_counts()
#The empty Series output shows that all nan values in iso_code are fixed
Series([], Name: country, dtype: int64)
df.isnull().sum()
country 0 iso_code 0 date 0 total_vaccinations 1421 people_vaccinated 1854 people_fully_vaccinated 2663 daily_vaccinations_raw 1849 daily_vaccinations 150 total_vaccinations_per_hundred 1421 people_vaccinated_per_hundred 1854 people_fully_vaccinated_per_hundred 2663 daily_vaccinations_per_million 150 vaccines 0 source_name 0 source_website 0 dtype: int64
df.columns
Index(['country', 'iso_code', 'date', 'total_vaccinations',
'people_vaccinated', 'people_fully_vaccinated',
'daily_vaccinations_raw', 'daily_vaccinations',
'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million',
'vaccines', 'source_name', 'source_website'],
dtype='object')
total_vaccination is split into people_vaccinated and people_fully_vaccinated and any NaN can be ignored.people_fully_vaccinated gives the people who have taken the vaccine twice. Any NaN in this can be ignoredtotal_vaccinations_per_hundred is the percent of total vaccinations(i.e. total_vaccinations by the population of the country). Let us rename thispeople_fully_vaccinated_per_hundred is the percent of total vaccinations(i.e. people_fully_vaccinated by the population of the country). Let us rename thisdaily_vaccinations_per_million is the total population of the country by its daily vaccinations. This doesn't have inconsistenciesdf.rename(columns = {'total_vaccinations_per_hundred':'total_vaccinations_percent',
'people_fully_vaccinated_per_hundred':'people_fully_vaccinated_percent',
'people_vaccinated_per_hundred':'people_vaccinated_percent'}, inplace=True)
df.columns
Index(['country', 'iso_code', 'date', 'total_vaccinations',
'people_vaccinated', 'people_fully_vaccinated',
'daily_vaccinations_raw', 'daily_vaccinations',
'total_vaccinations_percent', 'people_vaccinated_percent',
'people_fully_vaccinated_percent', 'daily_vaccinations_per_million',
'vaccines', 'source_name', 'source_website'],
dtype='object')
print('Data point starts from:',df.date.min(),'\n')
print('Data point ends at:',df.date.max(),'\n')
print('Total no of Countries in the data set:',len(df.country.unique()),'\n')
print('Total no of unique Vaccine Schemes in the data set:',len(df.vaccines.unique()),'\n')
Data point starts from: 2020-12-08 Data point ends at: 2021-02-27 Total no of Countries in the data set: 108 Total no of unique Vaccine Schemes in the data set: 20
# All the different contries
df.country.unique()
array(['Albania', 'Algeria', 'Andorra', 'Anguilla', 'Argentina',
'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh',
'Barbados', 'Belarus', 'Belgium', 'Bermuda', 'Bolivia', 'Brazil',
'Bulgaria', 'Cambodia', 'Canada', 'Cayman Islands', 'Chile',
'China', 'Colombia', 'Costa Rica', 'Croatia', 'Cyprus', 'Czechia',
'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
'Estonia', 'Faeroe Islands', 'Falkland Islands', 'Finland',
'France', 'Germany', 'Gibraltar', 'Greece', 'Greenland',
'Guernsey', 'Guyana', 'Hungary', 'Iceland', 'India', 'Indonesia',
'Iran', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Japan',
'Jersey', 'Kazakhstan', 'Kuwait', 'Latvia', 'Lebanon',
'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Maldives',
'Malta', 'Mauritius', 'Mexico', 'Monaco', 'Montenegro', 'Morocco',
'Myanmar', 'Nepal', 'Netherlands', 'New Zealand',
'Northern Cyprus', 'Norway', 'Oman', 'Pakistan', 'Panama',
'Paraguay', 'Peru', 'Poland', 'Portugal', 'Qatar', 'Romania',
'Russia', 'Saint Helena', 'Saudi Arabia', 'Senegal', 'Serbia',
'Seychelles', 'Singapore', 'Slovakia', 'Slovenia', 'South Africa',
'South Korea', 'Spain', 'Sri Lanka', 'Sweden', 'Switzerland',
'Trinidad and Tobago', 'Turkey', 'Turks and Caicos Islands',
'Ukraine', 'United Arab Emirates', 'United Kingdom',
'United States', 'Venezuela', 'Zimbabwe'], dtype=object)
# All the different kinds of vaccines
df.vaccines.unique()
array(['Pfizer/BioNTech', 'Sputnik V', 'Oxford/AstraZeneca',
'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech',
'Oxford/AstraZeneca, Sputnik V',
'Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V',
'Oxford/AstraZeneca, Sinovac', 'Sinopharm/Beijing',
'Moderna, Pfizer/BioNTech', 'Pfizer/BioNTech, Sinovac',
'Sinopharm/Beijing, Sinopharm/Wuhan, Sinovac',
'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V',
'Covaxin, Oxford/AstraZeneca', 'Sinovac',
'Oxford/AstraZeneca, Pfizer/BioNTech',
'Oxford/AstraZeneca, Pfizer/BioNTech, Sputnik V',
'Oxford/AstraZeneca, Sinopharm/Beijing',
'Oxford/AstraZeneca, Sinopharm/Beijing, Sputnik V',
'Johnson&Johnson',
'Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sinopharm/Wuhan, Sputnik V'],
dtype=object)
# Here we are creating `country_data` which store basic info about a country, like the vaccine scheme it uses, total
# vaccinations completed and its percentage with the population
country_data = df.copy()
cols = ['country', 'total_vaccinations', 'iso_code', 'vaccines', 'total_vaccinations_percent']
country_data = country_data[cols].groupby('country').max().sort_values('total_vaccinations', ascending=False)
country_data.reset_index(inplace = True)
country_data.columns = ['Country', 'Total Vaccinations', 'iso_code', 'Vaccines', 'Total Vaccinations Percentage']
country_data
| Country | Total Vaccinations | iso_code | Vaccines | Total Vaccinations Percentage | |
|---|---|---|---|---|---|
| 0 | United States | 72806180.0 | USA | Moderna, Pfizer/BioNTech | 21.77 |
| 1 | China | 40520000.0 | CHN | Sinopharm/Beijing, Sinopharm/Wuhan, Sinovac | 2.82 |
| 2 | United Kingdom | 20450858.0 | GBR | Oxford/AstraZeneca, Pfizer/BioNTech | 30.13 |
| 3 | India | 14242547.0 | IND | Covaxin, Oxford/AstraZeneca | 1.03 |
| 4 | Turkey | 8514775.0 | TUR | Sinovac | 10.10 |
| ... | ... | ... | ... | ... | ... |
| 103 | New Zealand | 1000.0 | NZL | Pfizer/BioNTech | 0.02 |
| 104 | Paraguay | 1000.0 | PRY | Sputnik V | 0.01 |
| 105 | Trinidad and Tobago | 440.0 | TTO | Oxford/AstraZeneca | 0.03 |
| 106 | Venezuela | 157.0 | VEN | Sputnik V | 0.00 |
| 107 | Saint Helena | 107.0 | SHN | Oxford/AstraZeneca | 1.76 |
108 rows × 5 columns
fig = px.bar(country_data[:50], x = 'Country', y = 'Total Vaccinations', color = 'Total Vaccinations')
fig.update_layout(title = dict(text = 'Vaccinizations World-Wide Comparision', x=0.5, y=0.95))
fig.update_xaxes(title = 'Countries', title_font = dict(size=18, family='Courier', color='crimson'), tickangle=-90)
fig.update_yaxes(title = 'Total Vaccinations', title_font = dict(size=18, family='Courier', color='crimson'))
fig.show()
From the plot, some interesting facts stand out:
top_countries = ['USA','CHN','GBR','IND','ISR','ARE','BRA','DEU','TUR','ITA','FRA']
fig = px.line(df[df.iso_code.isin(top_countries)], x='date', y='daily_vaccinations', color='country')
fig.update_layout(title = dict(text = 'World-Wide Daily Vaccination Timeline', x=0.5, y=0.95),
legend = dict(title = 'Country', traceorder = 'reversed'))
fig.update_xaxes(title = 'Timeline', title_font = dict(size=18, family='Courier', color='crimson'))
fig.update_yaxes(title = 'Daily Vaccinations', title_font = dict(size=18, family='Courier', color='crimson'))
fig.show()
From the plot, we can deduce:
top_country_data = country_data.sort_values('Total Vaccinations Percentage', ascending=False)[:40]
fig = px.bar(top_country_data, x = 'Country', y = 'Total Vaccinations Percentage', color = 'Total Vaccinations Percentage')
fig.update_layout(title = dict(text = 'Percentage of Vaccinated Population World-Wide Comparision', x=0.5, y=0.95))
fig.update_xaxes(title = 'Countries', title_font = dict(size=18, family='Courier', color='crimson'), tickangle=-90)
fig.update_yaxes(title = 'Percentage of Population Vaccinated', title_font = dict(size=14, family='Courier', color='crimson'))
fig.show()
106.5%. This means that Gibraltar has already completed Phase One of vaccination for its population.# group the df by the date and calculate the sum
vaccinated_df = df.copy()
vaccinated_df = vaccinated_df.groupby('date')[['date', 'people_fully_vaccinated', 'people_vaccinated']].sum()
# reset index is to pop out the date index and make a date colum in its place, and the sort via date
vaccinated_df.reset_index(inplace = True)
vaccinated_df.sort_values('date')
# plot the values
plot = go.Figure(data=[
go.Scatter(
x = vaccinated_df['date'],
y = vaccinated_df['people_vaccinated'],
stackgroup='two',
name = 'People Vaccinated once',
marker_color= '#35eb28'),
go.Scatter(
x = vaccinated_df['date'],
y = vaccinated_df['people_fully_vaccinated'],
stackgroup='one',
name = 'People Vaccinated twice',
marker_color= '#c4eb28')
])
plot.update_layout(title = dict(text= 'People vaccinated once vs Fully vaccinated till date', x = 0.5, y = 0.95))
plot.update_layout(legend = dict(orientation = "h", yanchor = "bottom", y = 1.02, xanchor = "right", x = 1))
plot.update_xaxes(title = 'Timeline', title_font = dict(size=18, family='Courier', color='crimson'))
plot.update_yaxes(title = 'Amount of vaccinated people', title_font = dict(size=18, family='Courier', color='crimson'))
plot.show()
From the plot we can determine:
People Vaccinated line has gradual ascent but has peculiar depths in the early days of daily vaccinations. These depths fall on Wednesday and Weekends, and so we can conclude that people don't prefer to vaccinate on Wednesdays and WeekendsPeople Vaccinated fully line, the ascent is gradual but much less than People Vaccinated line.People Vaccinated line starts to lift from Dec 19 and People Vaccinated fully line starts to lift from Jan 9. This is because of a gap of 3 weeks between the first and second dose of the vaccine.top_vaccine = country_data.copy().groupby('Vaccines').sum().sort_values(by = 'Total Vaccinations',ascending = False)
top_vaccine.reset_index(inplace = True)
fig = px.bar(top_vaccine, x = 'Vaccines', y = 'Total Vaccinations',
color = 'Vaccines', color_discrete_sequence = px.colors.sequential.Rainbow_r)
fig.update_layout(height= 575, title = dict(text = 'Total Vaccine per Scheme', x=0.5, y=0.95),
legend_title='Types of Vaccine Scheme')
fig.update_xaxes(title = 'Vaccines', title_font = dict(size=18, family='Courier', color='crimson'), showticklabels = False)
fig.update_yaxes(title = 'Amount of vaccinated people', title_font = dict(size=18, family='Courier', color='crimson'))
fig.show()
world_wide_vaccine_use = df.copy().vaccines.value_counts().to_dict()
vaccine_type = {}
for scheme,value in world_wide_vaccine_use.items():
for name in scheme.split(','):
vaccine_type[name.strip()] = vaccine_type.get(name.strip(),0) + value
fig = px.pie(values = vaccine_type.values(), names = vaccine_type.keys(),
color_discrete_sequence = px.colors.sequential.Sunset_r)
fig.update_layout(title = dict(text = 'Market share of Vaccines', x=0.5, y=0.95),
legend_title='Types of Vaccines')
fig.show()
From the above charts, we can see that :
fig = px.treemap(country_data, path = ['Vaccines', 'Country'], values = 'Total Vaccinations', height = 650,
custom_data = ['Country', 'Vaccines', 'Total Vaccinations'])
fig.update_layout(title = dict(text = 'Total vaccinations per country, grouped by Vaccine Scheme', x=0.5, y=0.95))
fig.update_traces(hovertemplate = 'Country: %{customdata[0]}<br>Vaccine: %{customdata[1]}<br>Total Vaccinations: %{customdata[2]}')
fig.show()
fig = px.choropleth(country_data, locations = 'Country', color = 'Vaccines',
hover_data = ['Country', 'Vaccines'], locationmode = "country names",
projection = 'natural earth')
fig.update_layout(legend = dict(title = "Vaccine Scheme", orientation = "h", y=-0.1), showlegend = False)
fig.update_layout(title = dict(text = 'Countries using same Vaccine Scheme', x=0.5, y=0.95),
geo = dict(showocean = True, oceancolor = "#7af8ff", showland = True,
landcolor = "white", showlakes = False, showframe = False))
fig.show()
fig = px.choropleth(country_data, locations = 'Country', color = 'Total Vaccinations',
locationmode = 'country names', color_continuous_scale = 'rainbow',
hover_name = 'Country', projection = 'natural earth')
fig.update_layout(title = dict(text = 'Total Vaccinations in every Country', x=0.5, y=0.95),
geo = dict(showocean = True, oceancolor = "#7af8ff", showland = True,
landcolor = "white",showlakes = False, showframe = False))
fig.show()
# Lets create a copy of the df, fill all the NaN values with a previous value and sort the values according to date
time_df = df.copy()
time_df.fillna(method='bfill', inplace=True)
time_df['date'] = pd.to_datetime(time_df['date'])
time_df = time_df.sort_values('date', ascending=True)
time_df['date'] = time_df['date'].dt.strftime('%m-%d-%Y')
fig = px.choropleth(time_df, locations = 'country', color = 'daily_vaccinations',
locationmode = 'country names', hover_name = 'country',
animation_frame = 'date', projection = 'natural earth',
color_continuous_scale = px.colors.sequential.Plasma)
fig.update_layout(title = dict(text = 'Daily Vaccinations World-Wide Timeline', x=0.47, y=0.95),
geo = dict(showocean = True, oceancolor = "#7af8ff", showland = True,
landcolor = "white", showlakes = False, showframe = False))
fig.show()
13-12-2020 to 27-2-2021)From our analysis we can conclude the following:
With this information we can realise that the Covid-19 vaccination drive is going at an incredible pace. Diseases like Polio and Smallpox, which caused many deaths, took decades and centuries to eradicate. At the current pace, Covid-19 can be eradicated in a period of 2 years.